{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/nceder/qpb4e/blob/main/code/Chapter%2020/Chapter_20.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 20 Basic file wrangling"
      ],
      "metadata": {
        "id": "6qxyG0sHdLeY"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 20.2 Scenario: The product feed from hell"
      ],
      "metadata": {
        "id": "uVs8Rx9zjH3L"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Quick Check : Consider the choices\n",
        "\n",
        "What are your options for handling the tasks I've identified? What modules in the standard library can you think of that will do the job? If you want, you can even stop right now and work out the code to do it. Then compare your solution with the one you develop later.\n",
        "\n",
        "#### Discussion\n",
        "\n",
        "From the standard library, use datetime for managing the dates/times of the files, and either os.path and os or pathlib for renaming and archiving the files.\n"
      ],
      "metadata": {
        "id": "oLfV3BO6WkDS"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# run this cell first to create files\n",
        "\n",
        "! touch item_info.txt item_attributes.txt related_items.txt"
      ],
      "metadata": {
        "id": "meqnci66d9bM"
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yT1uhmvgdI3t",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "78c791c6-3a75-4471-ef9a-2906b24e5de5"
      },
      "source": [
        "import pathlib\n",
        "cur_path = pathlib.Path(\".\")\n",
        "FILE_PATTERN = \"*.txt\"\n",
        "path_list = cur_path.glob(FILE_PATTERN)\n",
        "print(list(path_list))"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[PosixPath('related_items.txt'), PosixPath('item_info.txt'), PosixPath('item_attributes.txt')]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Listing 20.1 File files_01.py\n"
      ],
      "metadata": {
        "id": "VwNrUdWRjowv"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "R3GCOVyudI3u"
      },
      "source": [
        "# Listing 20.1 File files_01.py\n",
        "\n",
        "import datetime\n",
        "import pathlib\n",
        "\n",
        "FILE_PATTERN = \"*.txt\"             #A\n",
        "ARCHIVE = \"archive\"\n",
        "\n",
        "def main():\n",
        "\n",
        "    date_string = datetime.date.today().strftime(\"%Y-%m-%d\")    #B\n",
        "\n",
        "    cur_path = pathlib.Path(\".\")\n",
        "    archive_path = cur_path.joinpath(ARCHIVE)\n",
        "    archive_path.mkdir(exist_ok=True)        #C\n",
        "\n",
        "    paths = cur_path.glob(FILE_PATTERN)\n",
        "\n",
        "    for path in paths:\n",
        "        new_filename = f\"{path.stem}_{date_string}{path.suffix}\"\n",
        "        new_path = archive_path.joinpath(new_filename)        #D\n",
        "        path.rename(new_path)                      #E\n",
        "\n",
        "if __name__ == '__main__':\n",
        "     main()"
      ],
      "execution_count": 4,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Quick Check: Potential Problems\n",
        "Because the preceding solution is very simple, there are likely to be many situations that it won’t handle well. What are some potential issues or problems that might arise with the example script? How might you remedy these problems?\n",
        "\n",
        "Consider the naming convention used for the files, which is based on the year, month and day, in that order. What advantages do you see in that convention? What might be the disadvantages? Can you make any arguments for putting the date string somewhere else in the filename, such as the beginning or the end?\n",
        "\n",
        "#### Discussion\n",
        "\n",
        "Multiple files during the same day would be a problem, for one thing. If you have lots of files, navigating the archive directory will become increasingly  difficult.\n",
        "\n",
        "Using year-month-day date formats makes a text-based sort of the files sort by date as well. Putting the date at the end of the filename but before the extension makes it more difficult to parse the date element visually."
      ],
      "metadata": {
        "id": "YZn5TObBXepx"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# 20.3 More organization"
      ],
      "metadata": {
        "id": "HmO4egz5jYyg"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Listing 20.2 File files_02.py"
      ],
      "metadata": {
        "id": "41SzjZxJjkGc"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# run this before running cell below\n",
        "# clear files from archive directory\n",
        "\n",
        "! rm archive/*\n",
        "! touch item_info.txt item_attributes.txt related_items.txt"
      ],
      "metadata": {
        "id": "_LXEcAAeIrzX"
      },
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tAHWMGcEdI3v"
      },
      "source": [
        "# Listing 20.2 File files_02.py\n",
        "\n",
        "import datetime\n",
        "import pathlib\n",
        "\n",
        "FILE_PATTERN = \"*.txt\"\n",
        "ARCHIVE = \"archive\"\n",
        "\n",
        "def main():\n",
        "\n",
        "    date_string = datetime.date.today().strftime(\"%Y-%m-%d\")\n",
        "\n",
        "    cur_path = pathlib.Path(\".\")\n",
        "\n",
        "    archive_path = cur_path.joinpath(ARCHIVE)\n",
        "    archive_path.mkdir(exist_ok=True)                             #A\n",
        "    new_path = archive_path.joinpath(date_string)\n",
        "    new_path.mkdir(exist_ok=True)            #B\n",
        "\n",
        "    paths = cur_path.glob(FILE_PATTERN)\n",
        "\n",
        "    for path in paths:\n",
        "        path.rename(new_path.joinpath(path.name))\n",
        "\n",
        "if __name__ == '__main__':\n",
        "     main()"
      ],
      "execution_count": 6,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Try This: Implementation of multiple directories\n",
        "\n",
        "How would you modify the code that you developed to archive each set of files in subdirectories named according to date received? Feel free to take the time to implement the code and test it."
      ],
      "metadata": {
        "id": "n5DWqlbgX9CI"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Quick Check: Alternate solutions\n",
        "How might you create a script that does the same thing without using pathlib? What libraries and functions would you use?\n",
        "\n",
        "#### Discussion\n",
        "You'd use the os.path and os libraries—specifically, `os.path.join()`, `os.mkdir()`, and `os.rename()`."
      ],
      "metadata": {
        "id": "VtwZ7kZJYtMU"
      }
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "Z5hOgeveYFYn"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# @title\n",
        "import datetime\n",
        "import pathlib\n",
        "\n",
        "FILE_PATTERN = \"*.txt\"\n",
        "ARCHIVE = \"archive\"\n",
        "\n",
        "if __name__ == '__main__':\n",
        "\n",
        "    date_string = datetime.date.today().strftime(\"%Y-%m-%d\")\n",
        "\n",
        "    cur_path = pathlib.Path(\".\")\n",
        "\n",
        "    new_path = cur_path.joinpath(ARCHIVE, date_string)\n",
        "    new_path.mkdir()\n",
        "\n",
        "    paths = cur_path.glob(FILE_PATTERN)\n",
        "\n",
        "    for path in paths:\n",
        "        path.rename(new_path.joinpath(path.name))"
      ],
      "metadata": {
        "cellView": "form",
        "id": "WHM6kPLmYGB3"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 20.4.1 Compressing files"
      ],
      "metadata": {
        "id": "hklAeSftjxwD"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Listing 20.3 File files_03.py"
      ],
      "metadata": {
        "id": "uGJnCxqsj5Jm"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Try This: Archiving to zip files pseudocode\n",
        "\n",
        "Write the pseudocode for a solution that stores data files in zip files. What modules and functions or methods do you intend to use? Try coding your solution to make sure that it works.\n",
        "\n",
        "#### Discussion\n",
        "Pseudocode:\n",
        "```\n",
        "create path for zip file\n",
        "create empty zipfile\n",
        "for each file\n",
        "    write into zipfile\n",
        "    remove original file\n",
        "```\n",
        "(See the next section for sample code that does this.)"
      ],
      "metadata": {
        "id": "Zu2yMayKYUZx"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# run this before running cell below\n",
        "# clear files from archive directory\n",
        "\n",
        "! rm -rfr archive/*\n",
        "! touch item_info.txt item_attributes.txt related_items.txt"
      ],
      "metadata": {
        "id": "qFGqQ9pLJAq3"
      },
      "execution_count": 7,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "DdMt_0AcdI3v"
      },
      "source": [
        "# Listing 20.3 File files_03.py\n",
        "\n",
        "import datetime\n",
        "import pathlib\n",
        "import zipfile          #A\n",
        "\n",
        "FILE_PATTERN = \"*.txt\"\n",
        "ARCHIVE = \"archive\"\n",
        "\n",
        "def main():\n",
        "\n",
        "    date_string = datetime.date.today().strftime(\"%Y-%m-%d\")\n",
        "\n",
        "    cur_path = pathlib.Path(\".\")\n",
        "    archive_path = cur_path.joinpath(ARCHIVE)\n",
        "    archive_path.mkdir(exist_ok=True)\n",
        "\n",
        "    paths = cur_path.glob(FILE_PATTERN)\n",
        "\n",
        "    zip_file_path = cur_path.joinpath(ARCHIVE, date_string + \".zip\")   #B\n",
        "    zip_file = zipfile.ZipFile(str(zip_file_path), \"w\")       #C\n",
        "\n",
        "    for path in paths:\n",
        "        zip_file.write(str(path))                                 #D\n",
        "        path.unlink()             #E\n",
        "\n",
        "if __name__ == '__main__':\n",
        "     main()"
      ],
      "execution_count": 8,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## 20.4.2 Grooming files"
      ],
      "metadata": {
        "id": "cQrO5qVCk_5v"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Listing 20.4 File files_04.py"
      ],
      "metadata": {
        "id": "CSiT0khclC0e"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# run this before running cell below\n",
        "# create zip files in archive directory\n",
        "from datetime import datetime, timedelta\n",
        "\n",
        "\n",
        "def populate_archive(zip_file_path, current_date):\n",
        "    for days in range(30, 40):\n",
        "        zip_date = current_date - timedelta(days=days)\n",
        "        new_zip_path = zip_file_path.joinpath(f\"{zip_date.strftime('%Y-%m-%d')}.zip\")\n",
        "        zip_file = new_zip_path.write_text(\"Test\")\n",
        "\n",
        "cur_path = pathlib.Path(\".\")\n",
        "zip_file_path = cur_path.joinpath(ARCHIVE)\n",
        "current_date = datetime.today()\n",
        "populate_archive(zip_file_path, current_date)"
      ],
      "metadata": {
        "id": "zfiyyAKJJLvN"
      },
      "execution_count": 9,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wQq4s_RqdI3v"
      },
      "source": [
        "# Listing 20.4 File files_04.py\n",
        "\n",
        "from datetime import datetime, timedelta\n",
        "import pathlib\n",
        "import zipfile\n",
        "\n",
        "FILE_PATTERN = \"*.zip\"\n",
        "ARCHIVE = \"archive\"\n",
        "ARCHIVE_WEEKDAY = 1\n",
        "def main():\n",
        "    cur_path = pathlib.Path(\".\")\n",
        "    zip_file_path = cur_path.joinpath(ARCHIVE)\n",
        "\n",
        "    paths = zip_file_path.glob(FILE_PATTERN)\n",
        "    current_date = datetime.today()    #A\n",
        "\n",
        "    for path in paths:\n",
        "        name = path.stem              #B\n",
        "        path_date = datetime.strptime(name, \"%Y-%m-%d\")     #C\n",
        "        path_timedelta = current_date - path_date          #D\n",
        "        if (path_timedelta > timedelta(days=30)\n",
        "                and path_date.weekday() != ARCHIVE_WEEKDAY):    #E\n",
        "            path.unlink()\n",
        "\n",
        "if __name__ == '__main__':\n",
        "     main()"
      ],
      "execution_count": 10,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Quick Check: Consider different parameters\n",
        "\n",
        "Take some time to consider different grooming options. How would you modify the code in the previous Try This to keep only one file a month? How would you change the code so that files from the previous month and older are groomed to save one a week? (Note: This is not the same as older than 30 days!)\n",
        "\n",
        "#### Discussion\n",
        "You could use something similar to the code above but also check the month of the file against the current month."
      ],
      "metadata": {
        "id": "tyG96WH3ZDnL"
      }
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "yd69dfGHZLZS"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}